Introduction

The purpose of this project was to do preliminary exploration of wine data, with a goal of selecting right attributes for classifying wine into one of the quality categories: poor, normal and excellent. Since the data was already in tidy structure, not much attention was payed in data wrangling, but finding relationships among the attributes and between attributes and the wine quality.

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Univariate Plots Section

## <ScaleContinuousPosition>
##  Range:  
##  Limits:    0 --    1

I found the range of Total Sulfur Dioxide surprisingly wide, so I chose to plot its histogram.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

Observed that the value of 289 for Total Sulfur Dioxide is an outlier and most of the wine have total sulfur dioxide value ranging from 22 to 62. The distribution is not a long tail distribution. It’s positively skewed.

fixed acidity.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

Fixed acidity has close to normal distribution.

Let’s observe volatile acidity.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

It looks like the volatile acidity has bimodal distribution. Nevertheless, the two modes of the distribution are very close to each other. Most of the values are in the range of 0.39 to 0.64.

Let’s summarize citric acid.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

The distribution of citric acid is not normal. Most of the values are between 0.0 and 0.6.

residual sugar.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

It seems like residual sugar as a normal distribution with long tail. Most of the values are located between 1.9 to 2.6.

variable chlorides.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

Chlorides have normal distribution with long tail. Most of the values are between 0.07 and 0.09

variable alchohol analysing.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

Distribution of variable alcohol is positively skewed. Most of the values are between 9.50 and 11.10.

Density variable.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0040

Density variable is distributed close to a perfectly normal distribution. Most of the values are between 0.9956 and 0.9978.

Analysing sulphates variable.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

Sulphates have mostly normal distribution which is slightly positively skewed with few outliers. Most values are located in the range 0.55 to 0.73

Analysing the pH variable.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

pH has close to normal distribution. Most of the values are in the range of 3.21 to 3.4. This indicates that wines are always acidic.

Univariate Analysis

The dataset is related to the quality of red wine. Each observation mentions the chemical characteristics of the wine. There are 1599 observations of 13 variables. One variable named ‘quality’ indicates the quality rating given after tasting the wine. The other variables are fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates and alcohol. All the variables are continuous numeric variables. The quality variable can be seen as factor variable with 0 to 10 i.e. 11 possible levels, where 6 leves are present in dataset i.e . 3-8.

The main feature is of interest in this dataset are quality and the other factors which play a significant role in determining the quality of wine. Some of them are alcohol, volatile.acidity, total.sulfur.dioxide and chlorides.

Color, flavor, aroma of wine can signficantly affect the perception of quality.citric.acid has unusual distribution where it’s not close to a normal distribution in any way as I expected. I didn’t perform any transformation on any variable.

Bivariate Plots :

Checking correlation between variables in dataset

##                                 X fixed.acidity volatile.acidity
## X                     1.000000000   -0.26848392     -0.008815099
## fixed.acidity        -0.268483920    1.00000000     -0.256130895
## volatile.acidity     -0.008815099   -0.25613089      1.000000000
## citric.acid          -0.153551355    0.67170343     -0.552495685
## residual.sugar       -0.031260835    0.11477672      0.001917882
## chlorides            -0.119868519    0.09370519      0.061297772
## free.sulfur.dioxide   0.090479643   -0.15379419     -0.010503827
## total.sulfur.dioxide -0.117849669   -0.11318144      0.076470005
## density              -0.368372087    0.66804729      0.022026232
## pH                    0.136005328   -0.68297819      0.234937294
## sulphates            -0.125306999    0.18300566     -0.260986685
## alcohol               0.245122841   -0.06166827     -0.202288027
## quality               0.066452608    0.12405165     -0.390557780
##                      citric.acid residual.sugar    chlorides
## X                    -0.15355136   -0.031260835 -0.119868519
## fixed.acidity         0.67170343    0.114776724  0.093705186
## volatile.acidity     -0.55249568    0.001917882  0.061297772
## citric.acid           1.00000000    0.143577162  0.203822914
## residual.sugar        0.14357716    1.000000000  0.055609535
## chlorides             0.20382291    0.055609535  1.000000000
## free.sulfur.dioxide  -0.06097813    0.187048995  0.005562147
## total.sulfur.dioxide  0.03553302    0.203027882  0.047400468
## density               0.36494718    0.355283371  0.200632327
## pH                   -0.54190414   -0.085652422 -0.265026131
## sulphates             0.31277004    0.005527121  0.371260481
## alcohol               0.10990325    0.042075437 -0.221140545
## quality               0.22637251    0.013731637 -0.128906560
##                      free.sulfur.dioxide total.sulfur.dioxide     density
## X                            0.090479643          -0.11784967 -0.36837209
## fixed.acidity               -0.153794193          -0.11318144  0.66804729
## volatile.acidity            -0.010503827           0.07647000  0.02202623
## citric.acid                 -0.060978129           0.03553302  0.36494718
## residual.sugar               0.187048995           0.20302788  0.35528337
## chlorides                    0.005562147           0.04740047  0.20063233
## free.sulfur.dioxide          1.000000000           0.66766645 -0.02194583
## total.sulfur.dioxide         0.667666450           1.00000000  0.07126948
## density                     -0.021945831           0.07126948  1.00000000
## pH                           0.070377499          -0.06649456 -0.34169933
## sulphates                    0.051657572           0.04294684  0.14850641
## alcohol                     -0.069408354          -0.20565394 -0.49617977
## quality                     -0.050656057          -0.18510029 -0.17491923
##                               pH    sulphates     alcohol     quality
## X                     0.13600533 -0.125306999  0.24512284  0.06645261
## fixed.acidity        -0.68297819  0.183005664 -0.06166827  0.12405165
## volatile.acidity      0.23493729 -0.260986685 -0.20228803 -0.39055778
## citric.acid          -0.54190414  0.312770044  0.10990325  0.22637251
## residual.sugar       -0.08565242  0.005527121  0.04207544  0.01373164
## chlorides            -0.26502613  0.371260481 -0.22114054 -0.12890656
## free.sulfur.dioxide   0.07037750  0.051657572 -0.06940835 -0.05065606
## total.sulfur.dioxide -0.06649456  0.042946836 -0.20565394 -0.18510029
## density              -0.34169933  0.148506412 -0.49617977 -0.17491923
## pH                    1.00000000 -0.196647602  0.20563251 -0.05773139
## sulphates            -0.19664760  1.000000000  0.09359475  0.25139708
## alcohol               0.20563251  0.093594750  1.00000000  0.47616632
## quality              -0.05773139  0.251397079  0.47616632  1.00000000

This table of correlation between variables gives us some idea about the pairs of variables which have significant correlation indicating a possible linear relationship.

Factors which are Highly correlated with density of wine

Intuitively it feels that density of wine should be determined by the amount of alcohol in it. Let’s analyse them in scatter plot.

## 
##  Pearson's product-moment correlation
## 
## data:  alcohol and density
## t = -22.838, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5322547 -0.4583061
## sample estimates:
##        cor 
## -0.4961798

As evident in the scatter plot, the density of wine is inversely proportional to the amount of alcohol present in it.

## 
##  Pearson's product-moment correlation
## 
## data:  chlorides and density
## t = 8.1842, df = 1597, p-value = 5.541e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1531171 0.2472220
## sample estimates:
##       cor 
## 0.2006323
## [1] "Summary of Chlorides:"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

It’s evident in the graph that for majority of values of chlorides(from 0.07 to 0.09), the density is positively correlated with chlorides.

## 
##  Pearson's product-moment correlation
## 
## data:  residual.sugar and density
## t = 15.189, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3116908 0.3973835
## sample estimates:
##       cor 
## 0.3552834
## [1] "Summary of residual sugar:"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

It’s evident in the graph that for majority of values of residual.sugar(from 1.9 to 2.6), the density is positively correlated with residual sugar.

## 
##  Pearson's product-moment correlation
## 
## data:  citric.acid and density
## t = 15.665, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3216809 0.4066925
## sample estimates:
##       cor 
## 0.3649472
## [1] "Summary of citric acid:"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

The graph that for majority of values of citric.acid(from 1.9 to 2.6), the density is positively correlated with citric acid.

Exploring effect of different chemical properties in deciding the perceived quality of wine

The prsence of sulfur dioxides in the wine can be detected by smell if they are present in excess amount.Relation between total sulfur dioxide and quality of wine.

## 
##  Pearson's product-moment correlation
## 
## data:  total.sulfur.dioxide and quality
## t = -7.5271, df = 1597, p-value = 8.622e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2320162 -0.1373252
## sample estimates:
##        cor 
## -0.1851003

As evident in the graph, the wines with higher sulfur dioxide have received rating of 5.

The relationship between fixed acidity and quality.

## 
##  Pearson's product-moment correlation
## 
## data:  fixed.acidity and quality
## t = 4.996, df = 1597, p-value = 6.496e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.07548957 0.17202667
## sample estimates:
##       cor 
## 0.1240516

Wines with higher fixed acidity seem to have received higher quality ratings.

The relationship between volatile acidity and quality.

## 
##  Pearson's product-moment correlation
## 
## data:  volatile.acidity and quality
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4313210 -0.3482032
## sample estimates:
##        cor 
## -0.3905578

Based on the graph and correlation coefficient, it seems that increasing volatile acidity reduces the perceived quality of the wine.

How chlorides has any signficiant relationship with the quality of wine.

## 
##  Pearson's product-moment correlation
## 
## data:  chlorides and quality
## t = -5.1948, df = 1597, p-value = 2.313e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.17681041 -0.08039344
## sample estimates:
##        cor 
## -0.1289066

It looks that for very high values of chlorides, the perceived quality of wine is low.

checked how the quantity of sulphates affect the quality of wine.

## 
##  Pearson's product-moment correlation
## 
## data:  sulphates and quality
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2049011 0.2967610
## sample estimates:
##       cor 
## 0.2513971

In general, for the wines with higher amount of sulphates, the received quality ratings are high. But we must also notice that there are large number of samples with extremely high values of sulphates, the given quality rating is 5.

I see it’s mentioned in the description of variables in txt file accompanying the data that citric acid is added for freshness.

## 
##  Pearson's product-moment correlation
## 
## data:  citric.acid and quality
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1793415 0.2723711
## sample estimates:
##       cor 
## 0.2263725

There seems to be a weak relationship between citric acid and quality where high values of citric acid leads to lower quality rating.

And finally, let’s see if amount of alcohol in the wine affects the perceived quality of wine.

## 
##  Pearson's product-moment correlation
## 
## data:  alcohol and quality
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4373540 0.5132081
## sample estimates:
##       cor 
## 0.4761663

There is a significant positive correlation between amount of alcohol and perceived quality of wine.

## 
##  Pearson's product-moment correlation
## 
## data:  density and quality
## t = -7.0997, df = 1597, p-value = 1.875e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2220365 -0.1269870
## sample estimates:
##        cor 
## -0.1749192

As density decreases, the perceived quality of wine also decreases. This can be understood from the fact that alcohol which is major determiner of density.

Bivariate Analysis

Tried to see the relationship between quality and each different variable in the dataset. I drew scatter plot and calculated correlation coefficients for these variables.

Found strong correlation between the pH and total acidity in the wine. By total acidity, I mean fixed plus volatile acidity combined. I also observed that the density of wine has strong negative correlation with the alcohol content in wine.

The correlation between the density of wine with the percentage of alcohol as the strongest relationship in this dataset. Other than that I also observed that the quality of wine is significantly determined by the percentage of alcohol it contains.

Multivariate Analysis

observed relationship between quality of wine with different variables in the dataset, I found that amount of alcohol, volatile acidity, chlorides and total sulfur dioxide has great impact on deciding the perceived quality of wine.

As we can see, most of the high quality rated wine observarations are those with higher alcohol and lower volatile acidity, while many of the mediocre to low quality rated wines are having lower alcohol and high volatile acidity. Some of the best quality of the wines are those with high alcohol and medium level of volatile acidity.

It’s visible that wines with relatively low total sulfur dioxide are getting higher qualtiy ratings. The wines with low alcohol are again getting lower quality rating.

Most of the blue dots are concentrated in the lower left region. It’s clearly visible here that wines with low volatile acidity and low total sulfur dioxide have higher perceived quality than other samples.

Training a linear model with the dataset to predict the perceived quality of wine.

## 
## Call:
## lm(formula = quality ~ alcohol + volatile.acidity + chlorides, 
##     data = redWine)
## 
## Coefficients:
##      (Intercept)           alcohol  volatile.acidity         chlorides  
##           3.1574            0.3106           -1.3821           -0.3343

As the model suggests, the percentage of alcohol present in wine positively contributes to the quality of wine while volatile.acidity and chlorides contribute negatively to quality as evident by the coefficients.

Final Plots:

Plot1

The density of wine largely depends on the alcohol content as depicted in the plot below. As we can see, the density of wine reduces as the percentage of alcohol in wine increases.

Plot 2

The pH of wine is dependent on the fixed and volatile acidity of wine.

As the total acidity increases, pH of the wine decreases.

Plot 3

Alcohol, volatile acidity, chlorides and total sulfur dioxide has great impact on deciding the perceived quality of wine.

From the plot, it seems like the perceived quality of wine is proportional to the alcohol content and volatile acidity and it’s inversely proportional to the total sulfur dioxide and chlorides.

Reflections

The major success and learning from this project was that I could understand through scatter plots, which of the various chemical properties of wine affect the perceived quality of wine.

I didn’t find any major difficulties while dealing with this data,I think pH level makes balance to provide good taste for wine.

Having information about the flavor, color and type of aroma (if it can be classified) could have significantly enriched our analysis, as I believe that this properties highly influence the perceived quality of wine.